Data

Where do these files come from?

The raw and derived data files are accessible from the Github Repo for this project.

WB Projects & Operations

World Bank Projects & Operations were obtained from:

The Accessibility Classification is public under Creative Commons Attribution 4.0

Process to ingest & preprocess raw PDO text data

  1. Retrieve ALL WB projects (22,571) listed (approval obtained or requested between FY 1947 and 2026 as of 31/08/2024) using the Excel button on this page: WBG Projects
  2. Split the dataset and keep only projs_train (~50% of projects with PDO text, i.e. 5,637 PDOs)
  3. Clean the projs_train dataset
  4. Further processing of the column pdo

Input data files

These files in the folder data/raw_data/ are downloaded from the World Bank website.

List of Source Files and Retrieval Dates
Source File Name Details Retrieved
project2/all_projects_as_of29ago2024.xls 22,571 obs (projects) 29 of August 2024
project3/all_projects_as_of31mar2025.xlsx (Sheet World Bank Projects) 22,210 obs (projects) 31 of March 2025
project3/all_projects_as_of31mar2025.xlsx (Sheet Themes) 22,210 obs (projects) 31 of March 2025
project3/all_projects_as_of31mar2025.xlsx (Sheet Sectors) 22,210 obs (projects) 31 of March 2025
project3/all_projects_as_of31mar2025.xlsx (Sheet GEOLocations) 22,210 obs (projects) 31 of March 2025
project3/all_projects_as_of31mar2025.xlsx (Sheet Financers) 22,210 obs (projects) 31 of March 2025
wdr.rds 44 obs (WDRs) from 2022, then completed manually

Output data files

These files in the folder data/derived_data/ are created in different scripts and saved here to be reused in other scripts.

List of Intermediate `.rds` Files
File *.rds name Source File Name Details
traking.rds analysis/01a_WB_project_pdo_prep.qmd recap of missing elements
traking_k.rds analysis/01a_WB_project_pdo_prep.qmd recap of missing elements (kable tbl)
all_proj_t.rds analysis/01a_WB_project_pdo_prep.qmd 11,279 obs (projects)
projs_train.rds analysis/01a_WB_project_pdo_prep.qmd 5,637 obs (projects)
4,425 if < 2001 FY
projs_test.rds analysis/01a_WB_project_pdo_prep.qmd 2,821 obs (projects)
projs_val.rds analysis/01a_WB_project_pdo_prep.qmd 2,820 obs (projects)
pdo_train_to_tag.rds analysis/01a_WB_project_pdo_prep.qmd 5,637 obs intermediate step (input)
Post split
pdo_train_tagged.rds analysis/01a_WB_project_pdo_prep.qmd LARGE cnlp thing ....
intermediate step (OUTPUT)
Post split
pdo_train_t.rds analysis/01a_WB_project_pdo_prep.qmd 314,821 obs (tokens)
248,256 if < 2001 FY
Post split
projs_train2.rds analysis/01b_WB_project_pdo_EDA.qmd 4,425 obs (projects)
changed
custom_stop_words.rds analysis/01b_WB_project_pdo_EDA.qmd as vector
custom_stop_words_df.rds analysis/01b_WB_project_pdo_EDA.qmd as df
wdr.rds [imported from OLD repo ~/Github/slogan_old/]
- slogan_old/_my_stuff/WDR-data-ingestion.Rmd
- problem, bc API changed so now not reproducible
- result as slogan_old/data/raw_data/WDR.rds
- slogan_old/01b_WDR_data-exploration_abstracts.Rmd
- result as slogan_old/data/raw_data/wdr.rds
as df (44)
WDR abstracts processed ~ like PDOs
wdr2.rds [added WDR 2023/2024 manually]
analysis/01b_WB_project_pdo_EDA.qmd
as df (46)